Sentiment Analysis and Prediction of 2016 Elections based on Social Media Posts


Methodology

Reference: KDNuggets

For this group project, our general approach is to compare which methods work best for this dataset by utilizing the statistical learning methods we learned in class and the common machine learning algorithms (noted by the Lexicon-based Approach) for analyzing large texts data.

Some important packages that we may use for this project:

tm :                
An R package that is useful for text mining purposes and analyzing corpus objects (such as finding associations, frequencies, etc.)

SentimentAnalysis:  
An R package that is popular in sentiment analyses. Has built-in dictionaries (Harvard-IV dictionary, Henry’s Financial dictionary, and Loughran-McDonald Financial dictionary) that list each word with its associated sentiment scores.

syuzhet, tidytext: 
Supplementary R packages for sentiment analysis.

caret:            
A popular Machine Learning R package that enables the use of common supervised and unsupervised methods.

MASS:            
An R package that enables the use of several famous statistical approaches like penalized regressions on matrices.

stringr:          
An R package that incorporates Regular Expression (regex) the enables the manipulation of character objects.

plotly, ggplot2:  
R packages for visualization.

magrittr:
An R package for pipelining.

Shiny:
An R package for interactive data exploration.

Of course, there will be some common R packages that we will or may be using like dplyr, purrr or readxl that are not mentioned in the list. Our plan

Lexicon-based Approach

In sentiment analysis, the lexicon-based approach can be seen as a “nonparametric” statistical approach to meta- or unstructured data. Traditionally, the Dictionary-based Approach is seen as the “easier” method in comparison to the Corpus-based Approach. The former method uses a predefined list of “sentiment scores” from the English dictionaries to sentimental words such as “angry” (with a score of -3) and “happy” (with a score of 3) by first getting rid of common stop words in a text data. Commonly, the scores are then aggregated before classifying the text as “negative”, “neutral”, or “positive”.

## Joining, by = "word"
word.count

Code Reference: Silge & Robinson, tidytext

The above approach uses the *Dictionary-based Approach` mentioned to analyze the word count of the 6 different Jane Austen novels.

On the other hand, the Corpus-based Approach investigates the polarity of the text by using ML methods. This method relies on the context of the context or domain that can inform the sentiment labels of our text data.

## (Intercept)           x 
##   3.4810675  -0.5440241

Code Reference: Embry, tm

A method (from the tm package) in Corpus-based Approach is as above. By Zipf’s Law, it is stated that the frequency of a word is inversely proportional to its statistical rank. The following two plots also follow the same law.

## (Intercept)           x 
##    2.824209   -0.478659
## (Intercept)           x 
##    5.212469   -0.728547

Code Reference: Embry, tm

Here, from the context of the corpus data, we could attempt to assign the sentiment labels.

Statistical Learning Approach

The statistical learning approach concerns about assigning and verifying labels (and some other useful ideas like dimensionality reduction, which could potentially reduce computation time when dealing with large datasets such as ours). Our concern in this analysis is how we nonparametrically assign sentiment labels without the help of a label response.

Our first approach might be to reduce the sheer number of dimensions of the text column by performing a PCA. This will help in reducing the computation time and prevent the problem of the ‘curse of dimensionality’ by picking only the useful dimensions. A biplot might be useful for this:

Code Reference: Stackoverflow

The challenge that we will face here is since we will be assigning the text variables into some columns of binary value, calculating the dissimilarity matrices may be difficult. After choosing the dimensions of the data, we will attempt to perform some common unsupervised and supervised learning methods to analyze the polarity of the texts. For that, we will be using the methods that we learned in class.

Predicting Election Results

After assigning the labels of each tweet, we will be using them to predict the election results by first determining the frequencies of the sentiment labels by electoral district to predict the election results in 2016. For that, we may come across some difficulties and potential bias, which will be explained in the next section of our proposal.


Challenges

For brevity and simplicity, we will present the possible challenges when undertaking this project in bullet points.

  1. The dataset

Our dataset has 34 columns and 397629 rows, with text being the feature of most interest. Due to that, data cleaning may be a problem, especially when there are many stopwords in the data, which may render some of the observations useless. Furthermore, upon further analysis, we found out that only 338331 are in English language, of which we have not yet considered which observations are located in the United States (given the fact that many users did not enable geolocation). Also, the dataset is only from the date 11/08/16 , which may introduce some bias and inaccuracy on the prediction result. Since our ultimate goal is to predict the election results and that only United States citizens can vote, this challenge may prevent our group to accurately predict them, which will probably be seen later on our confusion matrix. However, if based on depth of analysis, our group may be relatively well on the good side.

  1. Data visualization

Since we will be using a dataset that has many dimensions, visualizing them might be a problem. Due to that, our group decided to use the package plotly (whenever feasible) since it has a feature that allows us to visualize a 3d plot.

Code Reference: plotly

Also, to provide a more crisp and modern look to data visualizations, we will be using plotly and Shiny for better data exploration.

  1. Data merging

Using this dataset alone to predicition the elections results may not be realistic. Afterall, more often than not, relying on sentiments alone often fall short in an argument. Due to that, we may be using some packages, like rvest or twitteR, to webscrape some of the further needed information like manifestos, state laws and size of constituencies. This will take a long time.